{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# pandas 进阶修炼 ｜早起Python\n",
    "<br>\n",
    "\n",
    "**本习题由公众号【早起Python & 可视化图鉴】 原创，转载及其他形式合作请与我们联系（微信号`sshs321`)，未经授权严禁搬运及二次创作，侵权必究！**\n",
    "\n",
    "\n",
    "\n",
    "本习题基于 `pandas` 版本 `1.1.3`，所有内容应当在 `Jupyter Notebook` 中执行以获得最佳效果。\n",
    "\n",
    "不同版本之间写法可能会有少许不同，如若碰到此情况，你应该学会如何自行检索解决。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3 - 数据预览与预处理\n",
    "<br>\n",
    "\n",
    "在拿到数据第一步当然是对数据做一个大概的浏览，以及对缺失值重复值进行相关处理。\n",
    "\n",
    "本小节就将练习这部分的基本操作。\n",
    "\n",
    "注意1：为了尽可能多的介绍不同方法，因此本节部分操作不是必须的。\n",
    "\n",
    "注意2:当进入到 pandas 操作数据时，很多答案并非唯一也并非最优。\n",
    " "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 初始化\n",
    "\n",
    "<br>\n",
    "\n",
    "该 `Notebook` 版本为**纯习题版**\n",
    "\n",
    "如果需要答案或者提示，可以微信搜索公众号「早起Python」获取！"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "df = pd.read_excel(\"TOP250.xlsx\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 数据查看"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1 - 查看数据维度\n",
    "\n",
    "先看看数据多少行，多少列，对接下来的处理量心里有个数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2 - 随机查看5条数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3 - 查看数据前后5行"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4 - 查看数据基本信息\n",
    "\n",
    "看看数据类型，有无缺失值什么的"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5 - 查看数据统计信息｜数值\n",
    "\n",
    "查看 **数值型** 列的统计信息，计数、均值什么的"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6 - 查看数据统计信息｜离散\n",
    "\n",
    "查看 **离散型** 列的统计信息，计数、频率什么"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 7 - 查看数据统计信息｜整体\n",
    "\n",
    "查看 **全部** 列的统计信息"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 缺失值处理"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 8 - 计算缺失值｜总计"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "通过上面的查看，我们发现部分列是存在缺失值的，那么先看看一共存在多少个缺失值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 9 - 计算缺失值｜分列\n",
    "\n",
    "再看看具体每列有多少缺失值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 10 - 查看缺失值\n",
    "\n",
    "</br>\n",
    "\n",
    "为了后面更方便的处理缺失值，现在先看看全部缺失值所在的行"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "微信搜索公众号「早起Python」，关注后可以获得更多资源！"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 11- 高亮缺失值"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "很明显，虽然上一题找到了全部缺失值所在的行，但是看的很头疼\n",
    "\n",
    "-> 现在将缺失值进行高亮进一步查看\n",
    "\n",
    "指路：<font color = '#E36C07'>**2-15**</font>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 12 - 删除缺失值\n",
    "\n",
    "<br>\n",
    "\n",
    "处理缺失值最简单的方式，当然是将缺失值出现的行全部删掉～\n",
    "\n",
    "-> 现在，将缺失值出现的行全部删掉"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 13 - 缺失值补全｜整体填充\n",
    "\n",
    "<br>\n",
    "\n",
    "除了删除缺失值最省事之外，将全部缺失值替换为一个 **固定的值/文本** 也是一个较为省事的方法'\n",
    "\n",
    "-> 现在，将全部缺失值替换为 `*`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 14 - 缺失值补全｜向上填充"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "从上一小节的查看数据中，不难发现整理数据是按照评分进行降序排列的\n",
    "\n",
    "因此对于评分列的缺失值处理，我们可以用上一个电影的评分进行填充\n",
    "\n",
    "-> 现在将评分列的缺失值，替换为上一个电影的评分"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 15 - 缺失值补全｜整体均值填充"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "对于评价人数列的缺失值处理，我们可以使用整列的均值进行填充\n",
    "\n",
    "-> 现在，将评价人数列的缺失值，用整列的均值进行填充"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 16 - 缺失值补全｜上下均值填充\n",
    "\n",
    "<br>\n",
    "\n",
    "除了可以使用整列的均值进行填充，也可以使用缺失值位置的上下均值进行填充、\n",
    "\n",
    "-> 现在，将评价人数列的缺失值，用上下数字的均值进行填充"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 17 -缺失值补全｜匹配填充\n",
    "\n",
    "<br>\n",
    "\n",
    "除了利用均值填充，有时还需要根据另一列的值进行匹配填充\n",
    "\n",
    "-> 现在填充 “语言” 列的缺失值，要求根据 “国家/地区” 列的值进行填充\n",
    "\n",
    "> 例如 《海上钢琴师》国家/地区为 意大利，根据其他意大利国家对应的语言来看，应填充为 意大利语\n",
    "\n",
    "**注意：此题会有多种解法**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 122,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 重复值处理"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 18 - 查找重复值\n",
    "\n",
    "<br>\n",
    "\n",
    "将全部缺失值所在的行筛选出来\n",
    "\n",
    "**注意：此题会有多种解法**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 19 -查找重复值｜指定\n",
    "\n",
    "上面是所有列完全重复的情况，但有时我们只需要根据某列查找缺失值\n",
    "\n",
    "-> 查找 片名 列全部缺失值\n",
    "\n",
    "**注意：此题会有多种解法**\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 20 - 删除重复值"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "删除全部的重复值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 21 - 删除重复值｜指定"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "删除全部的重复值，但保留最后一次出现的值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 142,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](http://liuzaoqi.oss-cn-beijing.aliyuncs.com/2021/09/16/16317972442543.jpg?域名/sample.jpg?x-oss-process=style/stylename)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "calc(100% - 180px)",
    "left": "10px",
    "top": "150px",
    "width": "384px"
   },
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
